feat: partial weight sync (delta + selective) by nanjiangwill · Pull Request #1806 · THUDM/slime

nanjiangwill · 2026-04-05T04:32:01Z

Summary

Non-colocated weight sync that broadcasts only the changed-position payload over NCCL instead of full weights. Two modes (peer; pick at runtime):

selective — broadcast new param values at changed positions only (NaN as the "unchanged" sentinel); receiver overwrites those positions, leaves others alone. Lossless by construction (no arithmetic), wire values portion ~½ the size of fp32 delta. Inspired by arXiv:2509.19128.
delta — broadcast (current − snapshot) sparse-encoded; receiver applies additively. Lossless with --update-weight-delta-dtype fp32 (default). Inspired by Cursor Composer 2 and Fireworks AI — Frontier RL Is Cheaper Than You Think.

Measured on GLM-4.7-355B-A32B non-colocated (8 actor + 8 rollout nodes, 64 rollout H100s) in delta mode: wire shrinks ~30× (~170 GB → ~5.9 GB) and the broadcast stops dominating the sync phase. Per-sync density logged at 2–3%, which sits below the 1/32 ≈ 3.125% break-even, so sparse_indices is the right encoding for this workload.

Docs: docs/en/advanced/partial-weight-sync.md · docs/zh/advanced/partial-weight-sync.md
Example: examples/partial_weight_sync/run-glm4.7-355B-A32B-partial.sh (two pre-built PARTIAL_ARGS blocks: selective active by default, delta commented out)
SGLang patch: merged into docker/patch/latest/sglang.patch; applied during docker image build.

CLI surface

Trainer side (slime):

flag	values	applies to
`--update-weight-mode`	`full` / `selective` / `delta`	universal — picks the strategy
`--update-weight-delta-dtype`	`fp16` / `bf16` / `fp32`	delta only (math precision; ignored otherwise)
`--update-weight-partial-encoding`	`sparse_indices` / `sparse_bitmask` / `dense`	selective + delta
`--update-weight-base-sync-interval`	int	selective + delta (set very large to disable)
`--update-weight-partial-artifact-dir`	path	selective + delta (optional async writer)

SGLang side (auto-mirrored via --sglang- prefix):

flag	purpose
`--sglang-update-weight-partial-chunk-bytes`	byte cap per `model.load_weights` call on apply

--update-weight-mode={selective,delta} is rejected with --colocate — CUDA IPC has no wire to compress.

Code shape

Slime: UpdateWeightFromDistributedPartial extends UpdateWeightFromDistributed via a single _on_chunk hook on the base class (Template Method). Base behaviour is otherwise unchanged. One subclass for both modes; mode branching lives in _enqueue_partial_chunk (compute_selective vs compute_delta) and the load_format string passed to the receiver.

SGLang patch (in docker/patch/latest/sglang.patch):

Wire protocol: single PartialWeightSpec + PartialWeightEncoding + PartialWeightParam. Receiver dispatches on load_format ("selective" vs "delta").
Receiver: one entry point _update_partial_weights_from_distributed(..., mode) that picks the fill_value (NaN vs 0) and apply context (selective vs additive) inline. Decode is lazy per-param, peak HBM bounded by encoded_buffers + chunk_byte_cap.
Apply contexts: _selective_load_context (selective) and _additive_load_context (delta), sharing a _param_storage_predicate + _patched_in_place_writes helper. Both rewire Tensor.copy_ / fill_ scoped to writes whose destination is inside model param storage (bisect on captured data_ptr ranges).
Selective apply: param[~isnan(src)] = src[~isnan(src)] — overwrites changed positions, leaves NaN-marked positions untouched.
Delta apply: PyTorch's in-place add_ auto-promotes bf16+=fp32 to fp32 math and casts back on store, so deltas keep fp32 precision without an explicit cast.
post_load_weights (DeepSeek's w_kc/w_vc materialization) is wrapped to run with unmodified copy_/fill_, so derived tensors overwrite correctly under both contexts.

Results

run	sync time	wire	notes
full sync baseline	~50s	170 GB dense	reference
selective sync (`sparse_indices`)	TBD	TBD	pending — experiment running
delta sync (`sparse_indices`)	~22.6s	~5.9 GB	density ~2–3%, 598 chunks

Selective mode

Pending — experiment running. The example README has a placeholder section for selective traces and per-sync density/wall-clock numbers; will be filled in once the run completes.

Delta mode

W&B traces are in examples/partial_weight_sync/README.md (raw_reward / train-rollout logprob abs diff / update_weights_time).

Why this composes with future communication work

The feature only changes what bytes get shipped. The NCCL broadcast, Ray lock, bucket scheduling, and send/receive layers are untouched. Any future slime improvement to the weight-update communication path (better compute/broadcast overlap, pipeline-parallel sends, NIC-level tricks) stacks additively on top of the speedups here — both modes inherit it for free.

🤖 Generated with Claude Code

Add a non-colocated weight-sync mode that broadcasts (current - snapshot) sparse-encoded over NCCL and applies it additively on SGLang, instead of broadcasting every parameter on every step. At GLM-4.7-355B scale the wire shrinks ~30x (typical 2-3% density) and the sync phase becomes dominated by gather+convert rather than broadcast. Slime side: - --update-weight-mode {full,delta}: pick the sync strategy. - --delta-compression {sparse_indices,sparse_bitmask,dense}: wire encoding. - --delta-dtype {fp16,bf16,fp32}: subtraction + apply happen at this dtype. - --delta-full-interval N: periodic full sync (first sync is always full). - --delta-artifact-dir PATH: optional async per-chunk artifact writer. - Rejects --update-weight-mode delta with --colocate (CUDA IPC has no wire). - UpdateWeightFromDistributedDelta extends UpdateWeightFromDistributed via a single _on_chunk hook on the base class (template method); the base class is otherwise unchanged in behaviour. SGLang side (slime patch): - WeightDeltaSpec / WeightDeltaParam / WeightDeltaEncoding wire protocol in io_struct. - Receiver decodes lazily (per-param), then applies via load_weights with Tensor.copy_/fill_ rewired to add_ only inside param storage ranges. - In-place add_ between bf16 param and fp32 delta auto-promotes for math and casts back on store, so deltas keep fp32 precision without an extra cast or scratch allocation. Adds: - examples/delta_compression/ with a non-colocated GLM-4.7-355B launcher. - docs/en/advanced/delta-compression.md + zh translation.

- examples/README.md: drop "at 355B scale" from the general directory listing. - examples/delta_compression/README.md: replace "(typical 2-3% density at 355B)" with a model-agnostic note about RL fine-tuning density. - docs/{en,zh}/advanced/delta-compression.md: same de-anchoring in the overview paragraph; specific numbers stay in result tables only. - examples/delta_compression/run-glm4.7-355B-A32B-delta.sh: drop unused SCRIPT_DIR and the delta-flavoured comment on --update-weight-buffer-size (the flag is general).

…plain lossless Replace every --delta-full-interval 10000 occurrence with 30 (the actual argparse default) and add an inline note at each site: setting the flag to a very large integer (e.g. 10000) effectively disables periodic full syncs, which is fine because with --delta-dtype fp32 the apply is lossless — every bf16 value is exactly representable in fp32, the subtraction captures the exact difference between two stored bf16 values, and the receiver's in-place bf16 += fp32 add reproduces the trainer's bf16 state bit-for-bit on rounding back, so no error accumulates across deltas. Updates: example script, example README, en + zh docs.

…artial* Add a second partial-update mode 'selective' alongside the existing 'delta'. Both share the snapshot, sparse encoding, periodic-base-sync, and bucketed broadcast machinery; they differ in what's on the wire and how the receiver applies it: delta — wire = (current − snapshot) at delta_dtype; receiver: param += delta (in-place add, auto-promotes for fp32 math, casts back to param dtype on store). selective — wire = new param values at changed positions, with NaN as the "unchanged" sentinel in the dense decoded tensor; receiver: param[~isnan(src)] = src[~isnan(src)] (selective overwrite, leaves NaN positions untouched). Selective is lossless by construction (no arithmetic), the wire values portion is ~½ the size of fp32 delta, and the per-element apply is a direct masked copy. Selective requires float param dtype on the wire (validated at decode); slime always sends HF-format floats so this holds in practice. CLI rename: every flag now lives under --update-weight-*. Scope is encoded in the name itself: --update-weight-mode {full, delta, selective} --update-weight-delta-dtype # delta-only --update-weight-partial-encoding # delta + selective --update-weight-base-sync-interval # delta + selective --update-weight-partial-artifact-dir # delta + selective --sglang-update-weight-partial-chunk-bytes # delta + selective (receiver) Internal naming follows: WeightDelta* → PartialWeight* on the wire, UpdateWeightFromDistributedDelta → UpdateWeightFromDistributedPartial, DeltaSync → PartialSync, _decode_sparse_delta → _decode_sparse_partial, _additive_load_context stays + new _selective_load_context (they share a _param_storage_predicate helper + _patched_in_place_writes scaffolding). File: update_weight_from_distributed_delta.py → ..._partial.py. The slime orchestrator is one class (UpdateWeightFromDistributedPartial) with mode branching in _enqueue_partial_chunk; _send_partial_weights body is shared. The SGLang receiver dispatches on load_format ("delta" or "selective") to one of two thin entry points that share _update_partial_weights_from_distributed. Wire schema: single PartialWeightSpec class shared by both modes (structurally identical wire format); request field is .partial (was .delta). Docs (en + zh) and the example script use the new flag names and document selective mode without perf numbers (pending experiment).

The feature now covers two peer modes (delta + selective), so 'delta compression' is misleading at the umbrella level. Rename to 'partial weight sync': docs/{en,zh}/advanced/delta-compression.md → partial-weight-sync.md examples/delta_compression/ → examples/partial_weight_sync/ examples/.../run-glm4.7-355B-A32B-delta.sh → ..._partial.sh Internal references and index.rst entries updated. The example script and README now show two explicit PARTIAL_ARGS blocks (delta active by default, selective commented out) so users can flip modes by swapping which block is uncommented. Adds inspiration / prior-art references (Cursor Composer 2 + Fireworks AI for delta; arXiv:2509.19128 for selective) and a placeholder for selective W&B traces. Also documents that this feature only changes what bytes ship on the wire, so any future communication-layer optimization in slime stacks additively on top.

…sh sglang.patch - `--update-weight-base-sync-interval` default: 30 → 9999. Both partial modes are lossless under their defaults (delta with fp32 math, selective by construction), so periodic base syncs aren't needed for correctness. Help text and docs/examples updated to explain the override (e.g. 30 to verify against periodic full broadcasts). - Refresh docker/patch/latest/sglang.patch with the renamed CLI / wire types. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Trainer-side observability for partial weight sync. New wandb metrics on the rollout/step axis, emitted by UpdateWeightFromDistributedPartial: perf/update_weights_density fraction of positions that moved perf/update_weights_is_base_sync 0/1 flag (lets you disambiguate spikes) perf/update_weights_wire_bytes bytes actually shipped per sync Plumbing: weight updaters gain an update_weight_metrics dict + pop_metrics(). The actor drains it via log_perf_data(..., extra_metrics=...), mirroring slime's existing rollout_extra_metrics pattern in _log_rollout_data. Other changes: - nnz field on PartialChunk so density accounting is symmetric with byte_size - Reorder selective before delta in all prose/tables/help text/example script (example script now defaults to selective; delta block commented out) - README placeholder for the density plot under the Selective mode results section, with a note explaining step 0 is omitted (always base-sync = 1.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…unify modes The compute step now returns (payload, mask) per param via a new PartialPayload dataclass; the encoder reads the mask directly instead of re-deriving it from a sentinel. NaN no longer materialized on the sender in the common sparse paths — only inside _encode_dense for selective (debug-only encoding). - New PartialPayload(name, payload, mask) carrying per-param compute output. - compute_delta + compute_selective merged into one compute_payload(mode); the per-mode logic lives in a closure dispatched once before the shared loop. - _encode_sparse reads (pp.payload, pp.mask) directly; no predicate threading. - _make_indices_kv / _make_bitmask_kv flattened to plain _indices_kv / _bitmask_kv (factories were vestigial once is_active capture was removed). - Dense encoding extracted into _encode_dense; lazily NaN-marks selective tensors for the receiver-side sentinel — the only sender path that still touches NaN. Net: selective saves one full-size tensor allocation per param on the sender (no more torch.where(mask, tensor, NaN) materialization in the common case). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nanjiangwill marked this pull request as draft April 5, 2026 04:33

nanjiangwill force-pushed the delta-compression-feature branch 5 times, most recently from d2aa1c0 to 5fb928d Compare April 18, 2026 01:26

nanjiangwill marked this pull request as ready for review April 18, 2026 01:29

zhuzilin added the run-ci-megatron label Apr 23, 2026

nanjiangwill force-pushed the delta-compression-feature branch from 23b9059 to b056c46 Compare April 30, 2026 22:27

nanjiangwill force-pushed the delta-compression-feature branch from b1c4ae1 to db09918 Compare May 13, 2026 02:41

nanjiangwill changed the title ~~feat: delta compression for weight sync~~ feat: delta-compression weight sync May 13, 2026

nanjiangwill added 4 commits May 13, 2026 02:49

nanjiangwill changed the title ~~feat: delta-compression weight sync~~ feat: partial weight sync (delta + selective) May 13, 2026

nanjiangwill and others added 3 commits May 13, 2026 05:00

nanjiangwill force-pushed the delta-compression-feature branch from a99a867 to 4245bfa Compare May 14, 2026 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: partial weight sync (delta + selective)#1806

feat: partial weight sync (delta + selective)#1806
nanjiangwill wants to merge 8 commits into
mainfrom
delta-compression-feature

nanjiangwill commented Apr 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nanjiangwill commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

CLI surface

Code shape

Results

Selective mode

Delta mode

Why this composes with future communication work

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nanjiangwill commented Apr 5, 2026 •

edited

Loading